49 research outputs found
An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition
The two most popular loss functions for streaming end-to-end automatic speech
recognition (ASR) are the RNN-Transducer (RNN-T) and the connectionist temporal
classification (CTC) objectives. Both perform an alignment-free training by
marginalizing over all possible alignments, but use different transition rules.
Between these two loss types we can classify the monotonic RNN-T (MonoRNN-T)
and the recently proposed CTC-like Transducer (CTC-T), which both can be
realized using the graph temporal classification-transducer (GTC-T) loss
function. Monotonic transducers have a few advantages. First, RNN-T can suffer
from runaway hallucination, where a model keeps emitting non-blank symbols
without advancing in time, often in an infinite loop. Secondly, monotonic
transducers consume exactly one model score per time step and are therefore
more compatible and unifiable with traditional FST-based hybrid ASR decoders.
However, the MonoRNN-T so far has been found to have worse accuracy than RNN-T.
It does not have to be that way, though: By regularizing the training - via
joint LAS training or parameter initialization from RNN-T - both MonoRNN-T and
CTC-T perform as well - or better - than RNN-T. This is demonstrated for
LibriSpeech and for a large-scale in-house data set.Comment: Submitted to Interspeech 202
Factorized Blank Thresholding for Improved Runtime Efficiency of Neural Transducers
We show how factoring the RNN-T's output distribution can significantly
reduce the computation cost and power consumption for on-device ASR inference
with no loss in accuracy. With the rise in popularity of neural-transducer type
models like the RNN-T for on-device ASR, optimizing RNN-T's runtime efficiency
is of great interest. While previous work has primarily focused on the
optimization of RNN-T's acoustic encoder and predictor, this paper focuses the
attention on the joiner. We show that despite being only a small part of RNN-T,
the joiner has a large impact on the overall model's runtime efficiency. We
propose to factorize the joiner into blank and non-blank portions for the
purpose of skipping the more expensive non-blank computation when the blank
probability exceeds a certain threshold. Since the blank probability can be
computed very efficiently and the RNN-T output is dominated by blanks, our
proposed method leads to a 26-30% decoding speed-up and 43-53% reduction in
on-device power consumption, all the while incurring no accuracy degradation
and being relatively simple to implement.Comment: Submitted to ICASSP 202
Directional Source Separation for Robust Speech Recognition on Smart Glasses
Modern smart glasses leverage advanced audio sensing and machine learning
technologies to offer real-time transcribing and captioning services,
considerably enriching human experiences in daily communications. However, such
systems frequently encounter challenges related to environmental noises,
resulting in degradation to speech recognition and speaker change detection. To
improve voice quality, this work investigates directional source separation
using the multi-microphone array. We first explore multiple beamformers to
assist source separation modeling by strengthening the directional properties
of speech signals. In addition to relying on predetermined beamformers, we
investigate neural beamforming in multi-channel source separation,
demonstrating that automatic learning directional characteristics effectively
improves separation quality. We further compare the ASR performance leveraging
separated outputs to noisy inputs. Our results show that directional source
separation benefits ASR for the wearer but not for the conversation partner.
Lastly, we perform the joint training of the directional source separation and
ASR model, achieving the best overall ASR performance.Comment: Submitted to ICASSP 202
Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations
Load imbalance pervasively exists in distributed deep learning training
systems, either caused by the inherent imbalance in learned tasks or by the
system itself. Traditional synchronous Stochastic Gradient Descent (SGD)
achieves good accuracy for a wide variety of tasks, but relies on global
synchronization to accumulate the gradients at every training step. In this
paper, we propose eager-SGD, which relaxes the global synchronization for
decentralized accumulation. To implement eager-SGD, we propose to use two
partial collectives: solo and majority. With solo allreduce, the faster
processes contribute their gradients eagerly without waiting for the slower
processes, whereas with majority allreduce, at least half of the participants
must contribute gradients before continuing, all without using a central
parameter server. We theoretically prove the convergence of the algorithms and
describe the partial collectives in detail. Experimental results on
load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show
that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous
SGD, without losing accuracy.Comment: Published in Proceedings of the 25th ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming (PPoPP'20), pp. 45-61. 202
SparCML: High-Performance Sparse Communication for Machine Learning
Applying machine learning techniques to the quickly growing data in science
and industry requires highly-scalable algorithms. Large datasets are most
commonly processed "data parallel" distributed across many nodes. Each node's
contribution to the overall gradient is summed using a global allreduce. This
allreduce is the single communication and thus scalability bottleneck for most
machine learning workloads. We observe that frequently, many gradient values
are (close to) zero, leading to sparse of sparsifyable communications. To
exploit this insight, we analyze, design, and implement a set of
communication-efficient protocols for sparse input data, in conjunction with
efficient machine learning algorithms which can leverage these primitives. Our
communication protocols generalize standard collective operations, by allowing
processes to contribute arbitrary sparse input data vectors. Our generic
communication library, SparCML, extends MPI to support additional features,
such as non-blocking (asynchronous) operations and low-precision data
representations. As such, SparCML and its techniques will form the basis of
future highly-scalable machine learning frameworks
Marian: Fast Neural Machine Translation in C++
We present Marian, an efficient and self-contained Neural Machine Translation
framework with an integrated automatic differentiation engine based on dynamic
computation graphs. Marian is written entirely in C++. We describe the design
of the encoder-decoder framework and demonstrate that a research-friendly
toolkit can achieve high training and translation speed.Comment: Demonstration pape
Towards an Automated Directory Information System
This paper describes a design and feasibility study for a large-scale automatic directory information system with a scalable architecture. The current demonstrator, called PADIS-XL 1, operates in realtime and handles a database of a medium-size German city with 130,000 listings. The system uses a new technique of taking a combined decision on the joint probability over multiple dialogue turns, and a dialogue strategy that strives to restrict the search space more and more with every dialogue turn. During the course of the dialogue, the last name of the desired subscriber must be spelled out. The spelling recognizer permits continuous spelling and uses a context-free grammar to parse common spelling expressions. This paper describes the system architecture, our maximum a-posteriori (MAP) decision rule, the spelling grammar, and the dialogue strategy. We give results on the SPEECHDAT and SIETILL databases on recognition of first names by spelling and on jointly deciding on the spelled and the spoken name. In a 35,000-names setup, the joint decision reduced name-recognition errors by 31%. 1